27 research outputs found
STARC: Structured Annotations for Reading Comprehension
We present STARC (Structured Annotations for Reading Comprehension), a new
annotation framework for assessing reading comprehension with multiple choice
questions. Our framework introduces a principled structure for the answer
choices and ties them to textual span annotations. The framework is implemented
in OneStopQA, a new high-quality dataset for evaluation and analysis of reading
comprehension in English. We use this dataset to demonstrate that STARC can be
leveraged for a key new application for the development of SAT-like reading
comprehension materials: automatic annotation quality probing via span ablation
experiments. We further show that it enables in-depth analyses and comparisons
between machine and human reading comprehension behavior, including error
distributions and guessing ability. Our experiments also reveal that the
standard multiple choice dataset in NLP, RACE, is limited in its ability to
measure reading comprehension. 47% of its questions can be guessed by machines
without accessing the passage, and 18% are unanimously judged by humans as not
having a unique correct answer. OneStopQA provides an alternative test set for
reading comprehension which alleviates these shortcomings and has a
substantially higher human ceiling performance.Comment: ACL 2020. OneStopQA dataset, STARC guidelines and human experiments
data are available at https://github.com/berzak/onestop-q
Reconstructing Native Language Typology from Foreign Language Usage
Linguists and psychologists have long been studying cross-linguistic
transfer, the influence of native language properties on linguistic performance
in a foreign language. In this work we provide empirical evidence for this
process in the form of a strong correlation between language similarities
derived from structural features in English as Second Language (ESL) texts and
equivalent similarities obtained from the typological features of the native
languages. We leverage this finding to recover native language typological
similarity structure directly from ESL text, and perform prediction of
typological features in an unsupervised fashion with respect to the target
languages. Our method achieves 72.2% accuracy on the typology prediction task,
a result that is highly competitive with equivalent methods that rely on
typological resources.Comment: CoNLL 201
Predicting Native Language from Gaze
A fundamental question in language learning concerns the role of a speaker's
first language in second language acquisition. We present a novel methodology
for studying this question: analysis of eye-movement patterns in second
language reading of free-form text. Using this methodology, we demonstrate for
the first time that the native language of English learners can be predicted
from their gaze fixations when reading English. We provide analysis of
classifier uncertainty and learned features, which indicates that differences
in English reading are likely to be rooted in linguistic divergences across
native languages. The presented framework complements production studies and
offers new ground for advancing research on multilingualism.Comment: ACL 201
Second language learning from a multilingual perspective
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 119-127).How do people learn a second language? In this thesis, we study this question through an examination of cross-linguistic transfer: the role of a speaker's native language in the acquisition, representation, usage and processing of a second language. We present a computational framework that enables studying transfer in a unified fashion across language production and language comprehension. Our framework supports bidirectional inference between linguistic characteristics of speakers' native languages, and the way they use and process a new language. We leverage this inference ability to demonstrate the systematic nature of cross-linguistic transfer, and to uncover some of its key linguistic and cognitive manifestations. We instantiate our framework in language production by relating syntactic usage patterns and grammatical errors in English as a Second Language (ESL) to typological properties of the native language, showing its utility for automated typology learning and prediction of second language grammatical errors. We then introduce eye tracking during reading as a methodology for studying cross-linguistic transfer in second language comprehension. Using this methodology, we demonstrate that learners' native language can be predicted from their eye movement while reading free-form second language text. Further, we show that language processing during second language comprehension is intimately related to linguistic characteristics of the reader's first language. Finally, we introduce the Treebank of Learner English (TLE), the first syntactically annotated corpus of learner English. The TLE is annotated with Universal Dependencies (UD), a framework geared towards multilingual language analysis, and will support linguistic and computational research on learner language. Taken together, our results highlight the importance of multilingual approaches to the scientific study of second language acquisition, and to Natural Language Processing (NLP) applications for non-native language.by Yevgeni Berzak.Ph. D
Bridging Information-Seeking Human Gaze and Machine Reading Comprehension
In this work, we analyze how human gaze during reading comprehension is
conditioned on the given reading comprehension question, and whether this
signal can be beneficial for machine reading comprehension. To this end, we
collect a new eye-tracking dataset with a large number of participants engaging
in a multiple choice reading comprehension task. Our analysis of this data
reveals increased fixation times over parts of the text that are most relevant
for answering the question. Motivated by this finding, we propose making
automated reading comprehension more human-like by mimicking human
information-seeking reading behavior during reading comprehension. We
demonstrate that this approach leads to performance gains on multiple choice
question answering in English for a state-of-the-art reading comprehension
model
Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL
This work examines the impact of crosslinguistic transfer on grammatical errors in English as Second Language (ESL) texts. Using a computational framework that formalizes the theory of Contrastive Analysis (CA), we demonstrate that language specific error distributions in ESL writing can be predicted from the typological properties of the native language and their relation to the typology of English. Our typology driven model enables to obtain accurate estimates of such distributions without access to any ESL data for the target languages. Furthermore, we present a strategy for adjusting our method to low-resource languages that lack typological documentation using a bootstrapping approach which approximates native language typology from ESL texts. Finally, we show that our framework is instrumental for linguistic inquiry seeking to identify first language factors that contribute to a wide range of difficulties in second language acquisition.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216
Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
Understanding language goes hand in hand with the ability to integrate
complex contextual information obtained via perception. In this work, we
present a novel task for grounded language understanding: disambiguating a
sentence given a visual scene which depicts one of the possible interpretations
of that sentence. To this end, we introduce a new multimodal corpus containing
ambiguous sentences, representing a wide range of syntactic, semantic and
discourse ambiguities, coupled with videos that visualize the different
interpretations for each sentence. We address this task by extending a vision
model which determines if a sentence is depicted by a video. We demonstrate how
such a model can be adjusted to recognize different interpretations of the same
underlying sentence, allowing to disambiguate sentences in a unified fashion
across the different ambiguity types.Comment: EMNLP 201
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Universal Dependencies for Learner English
We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216